Please sign in on etherpad: https://pad.carpentries.org/2020-09-10-06-r2
download.file("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv", destfile = "data/gapminder_data.csv")
gapminder <- read.csv("data/gapminder_data.csv")
gapminder <- read.csv("https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_data.csv")
str() function on any R object:str(gapminder) #str stands for structure
## 'data.frame': 1704 obs. of 6 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
View(gapminder) #to look at it in a tab
R has a few data types it is good be aware of:
typeof(gapminder$year)
## [1] "integer"
typeof(gapminder$lifeExp)
## [1] "double"
typeof(3.14)
## [1] "double"
typeof(TRUE) # logical
## [1] "logical"
typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
## [1] "integer"
typeof('bannana')
## [1] "character"
class(gapminder)
## [1] "data.frame"
typeof(gapminder$continent)
## [1] "character"
typeof(gapminder$country) #character
## [1] "character"
typeof(gapminder$year)
## [1] "integer"
x <- c(1, 2.4, 3, 5) #what's the <- again?
x
## [1] 1.0 2.4 3.0 5.0
str(x)
## num [1:4] 1 2.4 3 5
typeof(x)
## [1] "double"
Couple of things * the c() function is used in R a lot - stands for combine and it will create a vector * there are other ways to create a vector but we use this a lot. * what happens if we create a mixed vector
y = c("dog", 1.4, 3.5, TRUE)
y
## [1] "dog" "1.4" "3.5" "TRUE"
str(y)
## chr [1:4] "dog" "1.4" "3.5" "TRUE"
typeof(y)
## [1] "character"
as.numeric, as.logical, as.characterchar_vector_nums <- c('1','2','3')
typeof(as.numeric(char_vector_nums))
## [1] "double"
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
summary(gapminder$country)
## Length Class Mode
## 1704 character character
?summary
summary(gapminder$country) # tell us lenght of character vector
## Length Class Mode
## 1704 character character
factor(gapminder$country)
summary(factor(gapminder$country))
## Afghanistan Albania Algeria
## 12 12 12
## Angola Argentina Australia
## 12 12 12
## Austria Bahrain Bangladesh
## 12 12 12
## Belgium Benin Bolivia
## 12 12 12
## Bosnia and Herzegovina Botswana Brazil
## 12 12 12
## Bulgaria Burkina Faso Burundi
## 12 12 12
## Cambodia Cameroon Canada
## 12 12 12
## Central African Republic Chad Chile
## 12 12 12
## China Colombia Comoros
## 12 12 12
## Congo Dem. Rep. Congo Rep. Costa Rica
## 12 12 12
## Cote d'Ivoire Croatia Cuba
## 12 12 12
## Czech Republic Denmark Djibouti
## 12 12 12
## Dominican Republic Ecuador Egypt
## 12 12 12
## El Salvador Equatorial Guinea Eritrea
## 12 12 12
## Ethiopia Finland France
## 12 12 12
## Gabon Gambia Germany
## 12 12 12
## Ghana Greece Guatemala
## 12 12 12
## Guinea Guinea-Bissau Haiti
## 12 12 12
## Honduras Hong Kong China Hungary
## 12 12 12
## Iceland India Indonesia
## 12 12 12
## Iran Iraq Ireland
## 12 12 12
## Israel Italy Jamaica
## 12 12 12
## Japan Jordan Kenya
## 12 12 12
## Korea Dem. Rep. Korea Rep. Kuwait
## 12 12 12
## Lebanon Lesotho Liberia
## 12 12 12
## Libya Madagascar Malawi
## 12 12 12
## Malaysia Mali Mauritania
## 12 12 12
## Mauritius Mexico Mongolia
## 12 12 12
## Montenegro Morocco Mozambique
## 12 12 12
## Myanmar Namibia Nepal
## 12 12 12
## Netherlands New Zealand Nicaragua
## 12 12 12
## Niger Nigeria Norway
## 12 12 12
## Oman Pakistan Panama
## 12 12 12
## (Other)
## 516
Now what if we want to permanently save country as a factor.
We might not need to do this in in a real case b/c w can always do on the fly if needed, but we could need to do it so we can alter levels (they could be ordinal).
Let’s add the factor version of country at the end of gapminder as a new column. How would you think we do that?
gapminder$countr_fac <- factor(gapminder$country)
str(gapminder)
## 'data.frame': 1704 obs. of 7 variables:
## $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap : num 779 821 853 836 740 ...
## $ countr_fac: Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
Notice the difference.
Ok let’s look at some helpful methods to inspect the data.
typeof(gapminder$year)
## [1] "integer"
typeof(gapminder$country)
## [1] "character"
str(gapminder$country)
## chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
str(gapminder$countr_fac)
## Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
length(gapminder)
## [1] 7
typeof(gapminder)
## [1] "list"
Let’s explore some other functions we can use to inspect dataframes:
nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 7
dim(gapminder)
## [1] 1704 7
colnames(gapminder)
## [1] "country" "year" "pop" "continent" "lifeExp"
## [6] "gdpPercap" "countr_fac"
head(gapminder)
## country year pop continent lifeExp gdpPercap countr_fac
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 Afghanistan
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 Afghanistan
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 Afghanistan
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971 Afghanistan
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811 Afghanistan
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134 Afghanistan
What’s a column in a data frame, again?
Right, it’s a vector. Let’s pull out a column as a vector by itself and explore how we subset vectors.
life_exp <- gapminder[['lifeExp']]
str(life_exp)
## num [1:1704] 28.8 30.3 32 34 36.1 ...
life_exp[1]
## [1] 28.801
life_exp[c(3,7)]
## [1] 31.997 39.854
c function here1:4
## [1] 1 2 3 4
life_exp[1:4]
## [1] 28.801 30.332 31.997 34.020
life_exp[-1:4]
Why didnt that work? Yes, -1:4 expands to -1,0,1,2,3,4
-1:4
## [1] -1 0 1 2 3 4
life_exp[-(1:4)]
Also works:
life_exp[-c(1:4)]
head(gapminder[3])
## pop
## 1 8425333
## 2 9240934
## 3 10267083
## 4 11537966
## 5 13079460
## 6 14880372
str(gapminder[3])
## 'data.frame': 1704 obs. of 1 variable:
## $ pop: num 8425333 9240934 10267083 11537966 13079460 ...
However if we use [[3]] it’ll return the column as a vector
head(gapminder[[3]])
## [1] 8425333 9240934 10267083 11537966 13079460 14880372
head(gapminder[["lifeExp"]])
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
str(gapminder[[3]])
## num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
A vector.
Think about it via this image:
Peeling onion
$ dollar sign can pull out a column by name. A lot easier to remember names than their numbers.
head(gapminder$year)
## [1] 1952 1957 1962 1967 1972 1977
We can pull out by rows and columns by using two arguments in []
gapminder[1:3,] #row 1-3 and all columns
## country year pop continent lifeExp gdpPercap countr_fac
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 Afghanistan
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 Afghanistan
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 Afghanistan
gapminder[3,] #
## country year pop continent lifeExp gdpPercap countr_fac
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007 Afghanistan
gapminder[3:10, 1:3]
## country year pop
## 3 Afghanistan 1962 10267083
## 4 Afghanistan 1967 11537966
## 5 Afghanistan 1972 13079460
## 6 Afghanistan 1977 14880372
## 7 Afghanistan 1982 12881816
## 8 Afghanistan 1987 13867957
## 9 Afghanistan 1992 16317921
## 10 Afghanistan 1997 22227415
Let’s subset gapminder and only include data from 87
gapminder[gapminder$year == 1987, ]
How about population greater than 15,000,000
gapminder[gapminder$pop >= 15000000,]
seAsia <- c("Myanmar","Thailand","Cambodia","Vietnam","Laos")
## extract the `country` column from a data frame (we'll see this later);
## convert from a factor to a character;
## and get just the non-repeated elements
countries <- unique(gapminder$country)
One way:
(countries=="Myanmar" | countries=="Thailand" |
countries=="Cambodia" | countries == "Vietnam" | countries=="Laos")
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
More elegant way and better way:
countries %in% seAsia
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
How to use with gapminder:
gapminder[gapminder$country %in% seAsia, ]
## country year pop continent lifeExp gdpPercap countr_fac
## 217 Cambodia 1952 4693836 Asia 39.417 368.4693 Cambodia
## 218 Cambodia 1957 5322536 Asia 41.366 434.0383 Cambodia
## 219 Cambodia 1962 6083619 Asia 43.415 496.9136 Cambodia
## 220 Cambodia 1967 6960067 Asia 45.415 523.4323 Cambodia
## 221 Cambodia 1972 7450606 Asia 40.317 421.6240 Cambodia
## 222 Cambodia 1977 6978607 Asia 31.220 524.9722 Cambodia
## 223 Cambodia 1982 7272485 Asia 50.957 624.4755 Cambodia
## 224 Cambodia 1987 8371791 Asia 53.914 683.8956 Cambodia
## 225 Cambodia 1992 10150094 Asia 55.803 682.3032 Cambodia
## 226 Cambodia 1997 11782962 Asia 56.534 734.2852 Cambodia
## 227 Cambodia 2002 12926707 Asia 56.752 896.2260 Cambodia
## 228 Cambodia 2007 14131858 Asia 59.723 1713.7787 Cambodia
## 1045 Myanmar 1952 20092996 Asia 36.319 331.0000 Myanmar
## 1046 Myanmar 1957 21731844 Asia 41.905 350.0000 Myanmar
## 1047 Myanmar 1962 23634436 Asia 45.108 388.0000 Myanmar
## 1048 Myanmar 1967 25870271 Asia 49.379 349.0000 Myanmar
## 1049 Myanmar 1972 28466390 Asia 53.070 357.0000 Myanmar
## 1050 Myanmar 1977 31528087 Asia 56.059 371.0000 Myanmar
## 1051 Myanmar 1982 34680442 Asia 58.056 424.0000 Myanmar
## 1052 Myanmar 1987 38028578 Asia 58.339 385.0000 Myanmar
## 1053 Myanmar 1992 40546538 Asia 59.320 347.0000 Myanmar
## 1054 Myanmar 1997 43247867 Asia 60.328 415.0000 Myanmar
## 1055 Myanmar 2002 45598081 Asia 59.908 611.0000 Myanmar
## 1056 Myanmar 2007 47761980 Asia 62.069 944.0000 Myanmar
## 1525 Thailand 1952 21289402 Asia 50.848 757.7974 Thailand
## 1526 Thailand 1957 25041917 Asia 53.630 793.5774 Thailand
## 1527 Thailand 1962 29263397 Asia 56.061 1002.1992 Thailand
## 1528 Thailand 1967 34024249 Asia 58.285 1295.4607 Thailand
## 1529 Thailand 1972 39276153 Asia 60.405 1524.3589 Thailand
## 1530 Thailand 1977 44148285 Asia 62.494 1961.2246 Thailand
## 1531 Thailand 1982 48827160 Asia 64.597 2393.2198 Thailand
## 1532 Thailand 1987 52910342 Asia 66.084 2982.6538 Thailand
## 1533 Thailand 1992 56667095 Asia 67.298 4616.8965 Thailand
## 1534 Thailand 1997 60216677 Asia 67.521 5852.6255 Thailand
## 1535 Thailand 2002 62806748 Asia 68.564 5913.1875 Thailand
## 1536 Thailand 2007 65068149 Asia 70.616 7458.3963 Thailand
## 1645 Vietnam 1952 26246839 Asia 40.412 605.0665 Vietnam
## 1646 Vietnam 1957 28998543 Asia 42.887 676.2854 Vietnam
## 1647 Vietnam 1962 33796140 Asia 45.363 772.0492 Vietnam
## 1648 Vietnam 1967 39463910 Asia 47.838 637.1233 Vietnam
## 1649 Vietnam 1972 44655014 Asia 50.254 699.5016 Vietnam
## 1650 Vietnam 1977 50533506 Asia 55.764 713.5371 Vietnam
## 1651 Vietnam 1982 56142181 Asia 58.816 707.2358 Vietnam
## 1652 Vietnam 1987 62826491 Asia 62.820 820.7994 Vietnam
## 1653 Vietnam 1992 69940728 Asia 67.662 989.0231 Vietnam
## 1654 Vietnam 1997 76048996 Asia 70.672 1385.8968 Vietnam
## 1655 Vietnam 2002 80908147 Asia 73.017 1764.4567 Vietnam
## 1656 Vietnam 2007 85262356 Asia 74.249 2441.5764 Vietnam
Instructions: We will group you up in room. Select one person to drive the computer, the others will give instructions on how to solve the problems. The driver shares their screen. Switch up half way thru for fun.
https://docs.google.com/document/d/1TrX2BVMB0VpMTYvA--nXj6joy8cRDZz5Zy8zhHHu9-I/edit#
plot()But if not:
install.packages("ggplot2")
library("ggplot2")
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point()
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) + scale_x_log10()
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line()
Make the points stand out:
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country)) +
geom_line(mapping = aes(color=continent)) + geom_point()
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line() + geom_point()
americas <- gapminder[gapminder$continent == "Americas",] #subset
ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
geom_line() +
facet_wrap( ~ country) +
theme(axis.text.x = element_text(angle = 45))
You’ll get hands-on tomorrow with Stephanie. Thanks everyone!